Sampling from naturally truncated power laws: The matchmaking paradox 
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Consider a network of M > 1 nodes connected by N S> 1 links, in which the distribution of the 
number of links per node follows a power law P(n) ~ n^ 1 ^" with exponent < a < 1. The power 
law is naturally truncated due to the fact that N is finite. A subset of m <C M nodes is sampled 
arbitrarily, yielding the sample mean 77: The average number of links per node, within the sampled 
subset. We explore the statistics of the sample mean 77 and show that its fluctuations around the 
population mean v — N/M are extremely broad and strongly skewed - yielding typical values which 
are systematically and significantly smaller than the population mean v. Applying these results to 
the case of bipartite networks, we show that the sample means of the two parts of these networks 
generally differ - the fact we call "matchmaking paradox" in the title. 

PACS numbers: 02.50.-r; 64.60.aq 
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In this letter we address the problem of sampling ran- 
dom networks with naturally truncated power-law distri- 
butions of the number of links. Consider a network con- 
sisting of M ^> 1 nodes connected by N 3> 1 links. Imag- 
ine that in a very large population [M — > 00, N — > 00, 
and N/M — > v = const) the distribution of the number 
of links tends to a power law 

P{n) ~ n- 1 -* 

for n large enough (with a normalization coefficient which 
might depend on the actual network size). In a finite pop- 
ulation, the finite mean v = N/M implies that the distri- 
bution of ri is truncated at some value: This is what we 
term a naturally truncated power-law distribution. Nat- 
ural truncation has to be distinguished from the finite 
size effects in growing networks (see e.g. [l[ and refer- 
ences therein). 

Imagine moreover that - like is always done in sta- 
tistical investigations - a random sample of size m < M 
nodes is drawn from the overall population of M nodes. If 
the corresponding power-law exponent is in the range < 
a < 1, then the mathematical expectation X^^Lo n P{ n ) 
diverges - and hence fails to coincide with the population 
mean v. In such a situation we inquire the distribution 
of the sample mean 77 = — X)i=i n ii where m denotes the 
number of links connected to the i th node of the sample. 
In particular, it is of interest to know whether the sample 
mean 77 is typically larger or smaller than the population 
mean v, and how do its statistics change as the sample 
size m is increased. 

The aforementioned problem is related to the "Levy 
matchmaking" problem. Imagine two sets of M 1 
nodes (the red and the blue nodes, or boys and girls). 
The nodes of the two sets are connected by N ^S> 1 links 
having a red node on one side and a blue one the other 
side. Although the number of the links is the same when 
seen from the red and from the blue side, the distribu- 



tions of the number of links attached to a red and to a 
blue nodes differ. In a very large population (M — -> 00 , 
N — > 00, and N/M —±v = const) they would follow 



P red (n) ~ n 



-1 — ai . 



^biuo(^) ^ ri 



-X—a.1 



for n large enough (with, in general, exponents ct\ ^ Q.%). 
In such a situation - when sampling from the red and 
blue populations - how different can the red and are the 
sample means be? 

The motivation for the Levy matchmaking problem is 
as follows. In the mid eighties several research groups 
were conducting investigations on the distribution of the 
number of sexual partners in different human populations 
promoted by the necessity to point out the risk groups in 
the AIDS epidemics. A "Nature" editorial by Maddox's 
contained the statement Q: 11 The figures so far show 
that the average number of heterosexual partners of men 
in the course of a lifetime is 11.0 and of women 2.9" . 
In response to Maddox' editorial, Gurman published a 
note explaining the nonsense of having different means 
in the two populations connected by well-defined one-to- 
one links |3j: "A heterosexual union is analogous to a 
heteronuclear chemical bond, and the total number must 
be the same if viewed from the male or female end" . 

This situation is more profound than it seemed to be. 
The empirical distribution of the number of partners is 
long-tailed follows a power-law, and its mathemati- 
cal expectation may diverge. Thus for exponents in the 
range < a < 1 the sample means depend systemati- 
cally on the sample size m and therefore have to differ 
for small samples in order to match each other for the 
population as a whole. This point is what we refer to 
as the "matchmaking paradox" in the title of this Let- 
ter. For exponents in the range a > 1 this is no more 
the case, and the sample means have to match. Up to 
our best knowledge, this exponent-dependency aspect of 
the problem was never considered in detail (probably due 
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to the lack, at that time, of an adequate mathematical 
toolbox). Moreover, the problem has much in common 
with other situations of weak ergodicity breaking. In- 
deed, sample means that "normally" should be the same, 
actually differ since one of them never reaches a sharp 

nun, 



value but shows universal fluctuations 

Later investigations [B|, [f| have shown that power-laws 
in a heterosexual population have exponents in the range 
a > 1, implying that the reason for sample- mean devia- 
tions should be looked for elsewhere (see e.g. Ref. 0]). 
Nonetheless, both the problems of sampling and match- 
making are of considerable interest - especially taking 
into account the overall importance of the sampling pro- 
cedures in networks [H, as well as the fact that the dis- 
tribution of the number of contacts in homosexual males 
follows a power-law with exponent a ~ 0.6, Ref.Q. 

The main issues explored in this research are the fol- 
lowing: What is a distribution of a sample mean rj cal- 
culated for a sample of size m 3> 1? And how does 
the sample mean r\ relate to the population mean v ? 
These issues are intimately connected to the statistics of 
Levy random probabilities, studied in Ref. 0] - but have 
several unique aspects which are worth a separate and 
detailed investigation. 

We follow Gurman's setup with a static, finite, bipar- 
tite population. To begin with, we establish a model 
yielding naturally truncated power-law distributions (of 
the links). Consider a large population consisting of 
2M > 1 nodes - M "red" and M "blue" - and N > 1 
links connecting the red and blue nodes. Each node has 
an "attractiveness" level: Each red node i (blue node j) 
has an attractiveness level fi (gj ) chosen at random from 
a one-sided Levy distribution with exponent ot\ ( 01-2). 
Each link connects - on each red/blue side - to a single 
node, the probability of connecting being proportional to 
the attractiveness levels. Hence, the probabilities (f>i and 
7j that the ends of a given link are connected to the red 
node i and to the blue node j are given by 



fi 



EM r ' 
k=l -> k 



n = 



L fc =i 9k 



Let us first concentrate on the red side of the network. 
As a statistical sample we chose at random a set of to < 
M of the red nodes. The probability that a given link is 
connected to one of the sample nodes is given by 



Pi 



£7 = 1 fj 



Si=l fi 



y- m f +T M f l + Y/X'' 



where X and Y are the independent one-sided Levy vari- 
ables with exponent a — a\, and with scaling parameters 
to 1 /" and (M— m) 1 /". The value of pt - the Levy random 
probability - is thus a random variable which coincides 
in distribution with 

1 



l+(M/m-l) 1/a R 



where R is quotient of two independent one-sided Levy 
variables with exponent a = a±. Henceforth, we set 
the shorthand notation x — (M/m — l) 1 ^". Note that 
the random variable z admits values in the unit interval 
(0, 1). Moreover, we note that even if the distributions 
of the attractiveness levels fi deviate from the one-sided 
Levy - but yet possess power-law asymptotics with ex- 
ponent a - then the distribution of z for m,M^> 1 is 
universal (in the sense of the corresponding limit theo- 
rem). Hence, our analysis does not depend on the precise 
form of distributions of the attractiveness levels fi . We 
further note that the introduction of the attractiveness 
levels was only a convenient intermediate step, and that 
the discussion to follow holds for any kind of naturally 
truncated power-law distributions with exponents in the 
range < a < 1. 

The probability density function (pdf ) of the quotient 
R is known 0] : Its Laplace transform is a Mittag-Leffler 
function C{pr(R)} = E a (—u a ) with u denoting the 
Laplace variable. And, the asymptotic behavior of pa 
for R large and small is obtained via Tauberian theo- 
rems from the asymptotics of the Mittag-Leffler function. 
Thus, for R large we have 



pr(R) = 



1 



R- 



(1) 



where L(-) is a Gamma function. 

Let h = X)i=i n i denote the number of "hits" in the 
sample. Given the value z of the probability of connect- 
ing to one of the sample nodes, the probability that h 
of N links "hit" the sample is given by the conditional 
binomial distribution 



p(h\z) = 



N\ 



h\{N-h) 



■z n (l-z) 



N-h 



Hence, the unconditional probability distribution of h is 
given by 



Ph{h) = 



Nl 



o h\{N-h)\ 



z n (l- z) N ~ h p z {z)dz 



For N ^> 1 the binomial distribution is actually ex- 
tremely narrow: Its standard deviation is much smaller 
than its mean, so that [N\/h\(N - h)l]z h (l - z) N - h w 
S(h — Nz). Thus we can take h = Nz ; the distribution 
of h follows from those of the Levy random probability 
z by change of variables. The distribution of the sample 
mean n = h/m = Nz/m, in turn, is given by 



N 



Pz [V 



N 



This fact can be proved by explicit calculation of the 
generating function of the distribution ph (•) - evaluating 
it in the range 1 -C h <C N via Tauberian theorems. 

Note that for M — > 00 and m <C M p z (z) practically 
follows the distribution of M~ 1 / a R, and is a power-law. 
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Taking m = 1 we arrive at the (continuous approximation 
for the) distribution of the number of links per node. The 
power law spreads over the domain of 1 -C h -C N and is 
truncated for h > N, as it is evident from the fact that 
p z {z) vanishes for z > 1. The sample mean 77 is therefore 
a random variable, and the properties of its distribution 
are discussed below. 

The mathematical expectation (77) of the sample mean 
X] is equal to the population mean v. Indeed, 



{''/) 



N 



» 



and 



z(R)p R (R)dR 



1 



1 + xR 



Pn(R)dR. (2) 



Noting that 1/(1 + xR) = x~ 1 (l/x + and substi- 

tuting the integral representation 



1 



1/x + R 



-u/x e -uR du 



(3) 



into Eq. - while interchanging the order of integration 
- yields: 



1 



dRt 



due- u,x 
x Jo Jo 

- / due- u/x EJ-u a ). 
x Jo 



-uR 



pr{R) 



(4) 



The right-hand-side of Eq. @ is the Laplace transform 
of this Mittag-Leffler function. This Laplace transform 
is known to be given by C{E a {— u a )} — s a ~ 1 /(s a + 1), 
and hence setting s = 1/x we arrive at 

1 



x a + 1 

Finally, recalling that x — (M/m — l) 1 /" we obtain that 
(z) =m/M and 

N m _ N _ 

The distribution of the sample mean 77, however, is 
extremely broad - as seen from its variance. To calculate 
the variance we note that (?/ 2 ) = (TV 2 /m 2 ) (z 2 ) and 



(z 2 ) = / z{R)p R {R)dR 
Jo 



(1 + xR) 



; PR {R)dR. 



Using the fact that (1 + xR) 2 = ^(1/x + R)^ 1 and the 
integral representation given by Eq.Q we get: 

d x ■ • 2 



dx 1 



m 



1 



CI (- I 

1 m 



From this we obtain that the variance of rj is given by: 



v2 M ^N 2 (M ,\ _ n ^ 2 M 



m 



Hence, the standard deviation a of rj is of the order of 
magnitude y/M/m 3> 1 - i.e., far larger than its mean 
(r/) . Therefore, it is highly improbable to obtain an ac- 
curate estimate of the population mean v from a sample 
with size m much smaller than the population size M. 

Not only is the distribution of rj extremely broad - it is 
also extremely skewed. As we now proceed to show, the 
median of r\ lays far below its mathematical expectation 
(r/) = v. And, finding values of r\ which are larger than its 
mathematical expectation {rf) = v is highly improbable. 
Hence, a typical result of a statistical measurement of r\ 
will be much smaller than the population mean v. 

Since z and therefore 77 are monotonous functions of R, 
their medians follow from the median of R. The random 
variable R is a quotient of two identically distributed ran- 
dom variables - hence the distribution of R is the same 
as the distribution of l/R. The random variable ln(i?) 
is therefore symmetric and, consequently, its median is 
zero - implying, in turn, that the median of R is unity: 
R1/2 — 1- Substituting the median R1/2 = 1 into the 
expressions for z and rj — (N/M)z we obtain that the 
median 771/2 of the sample mean 77 is given by: 



77i/2 



N 1 

771 1 + 



M 



(5) 



(equation §5§ holding for all m <C M). Clearly, the me- 
dian 771/2 is much smaller than the population mean v. 

Let us turn now to calculate the probability P+ that 
that the sample mean 77 be greater than the population 
mean v - i.e., the probability of the event {77 > v\. Using 
the asymptotic expression for pr{R) gives 



p z (z)dz = / p z (z)dz 

m/M Jm/M 



p R (R)dR 



Ro 



with Rq = (M/m- 
obtain that 



1 



Further using Eq.([T]), we 



aT 2 (a) \m 



M 

1 



(6) 



(equation ([6]) holding for all m -c M) . Clearly, the prob- 
ability P + is very small. 

Thus, in a Levy matchmaking problem, the sample 
means in different subpopulations not only fluctuate 
strongly, but also display a systematic difference. For 
the same sample size, the sub-population with smaller a 
- i.e., the one with a broader distribution - will typically 
show a smaller sample mean. The discussion above also 
gives a possibility to roughly estimate the unknown pop- 
ulation mean v from the typically smaller sample mean 



4 



rj. Such an extrapolation is given by Eq. §5§ (or by Eq. (|7|) 
- in the special Levy-Smirnov case). 

It is instructive to consider an analytically solvable ex- 
ample - the Levy-Smirnov case, corresponding to the ex- 
ponent value a = 0.5. This example is of special interest 
due to the fact that the exponent a — 0.5 is not too far 
from the exponent a « 0.6 obtained from the distribution 
of the number of partners in the population of homosex- 
ual males. The Levy-Smirnov pdf of the attractiveness 
levels is given by 

p(/) ^v^ exp (^)' 

for which 



Pz(Z) - ir\j z{\-z)l + (x-l)z- 

The quantiles of the corresponding distributions can be 
calculated explicitly - implying, in turn, that with prob- 
ability 0.5 the sample mean rj lays within the interval 

(^-l) 2 ^<ry<(V2 + l) 2 ^. (7) 

Namely, the sample mean rj is typically considerably 
smaller than the population mean v. Only as m — > M 
does the median r)i/ 2 converge to the population mean 
v. On the other hand, the distribution over samples is 
very skewed, and the probability that the sample mean 
rj be greater than the population mean v is given by 
P + ~ (2/n)y/m/M. Namely, P+ is very small for sample 
sizes m which are considerably smaller than the popula- 
tion size M. 

This "anomalous behavior" is typical in the cases of 
power-law distributions with divergent mathematical ex- 
pectation: P(n) ~ n _1_a with < a < 1. For exponents 
in the range a > 1 the sample mean shows no system- 
atic shift and fluctuates around the population mean. 
Specifically jl3j ]: In the range 1 < a < 2 the fluctua- 
tions are Levy-distributed, and of the order (^(m 1 / 0-1 ). 



And, in the range a > 2 these fluctuations are Normally 
distributed, and of the order 0(l/y / m). 

We considered the problem of sampling from a 
naturally-truncated power-law distribution, and the 
problem of matching two populations with different 
naturally-truncated power-law distributions sharing the 
same population mean. We have shown that the sample 
means - in case of sample sizes which are considerably 
smaller than the population size - fluctuate strongly and 
display systematic deviations from the population mean. 
Since the dependence of this systematic deviation on the 
number of sampled elements is known, this can be used 
to obtain a rough estimation of the population mean. 
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